Impact of Macroeconomics Elements on the Crime Rate in the USA

1. Introduction

FBI image

FBI image

1.1 Overview and motivation

We have chosen the topic of crime because it has always been topical, as we can still see today because of the increasing number of crimes against humanity, hate crimes and violent acts around the world. Multiple wars in Europe and the Middle East; dubious governments in Africa; social networks that have become a digital court or a channel for harassment; or religious institutions like the Church that has been qualified as one of the biggest networks of sexual assault and pedophilia are just a few impressions that the situation is deteriorating year after year. But has this always been the case?

The U.S. is the world’s largest economic and military power but is still aggressively criticized for its various laws governing the right to bear arms, racism, poverty, and failing health and justice systems. The United States is no exception to the rule of increasing crime rates. Some cities, such as Houston in the state of Texas, is considered one of the most dangerous cities on American soil, where crime is out of control, where murders and trafficking of all kinds are almost freely carried out; a worrying case that reveals the particular situation of the United States regarding public safety, as stated in the article “Is Crime up or down ? In Houston, concerns are hard to allay” from the news agency “The Associated Press”.

As all three of us have a real interest in this country, the choice of making it the focus of our study seemed obvious. The United States is a federal state, which means that different autonomous entities run the country at their own level, so the states have their own constitution and laws. Because of its great political, cultural, and social diversity, the United States is the perfect choice for our study, which will be all the more accomplished thanks to the wealth of data.

Also, as explained in the article “Stories about crime are rife with misinformation and racism, critics say”, opinions are strongly divided regarding the information relayed by the media about crime. Some say that public opinion is highly manipulated by the government controlling the media, others claim the objectivity of the American media. For this reason, we would like to conduct our own study and better understand crime in the United States.

1.3 Research question

The purpose of this analysis is to estimate the impact of several macroeconomic variables on the crime rate in the United States. For this project we consider only violent crime, namely: homicide, rape, aggravated assault, and robbery. In addition, we consider the following macroeconomic variables: population, poverty rate, foreign-born persons residing in the United States, Police and Sheriff’s Patrol Officers, number of federal firearms licensees, income inequality, and unemployment rate. These variables are reported by state for the year 2021 (2020 for data that have not been updated for the year 2021).

We consider the following five questions as guidelines for our study:

  1. Which states have high crime rates? We want to rank the states according to their crime rates, to identify potential similarities and/or differences, and to create groups that are representative of the entire country.

  2. How have crime rates changed over time? Find out if states with high crime rates have always been high crime states, or if they have become high crime states (if so, understand why).

  3. What are the common variables influencing crime rates? We will identify the variables that most impact the crime rate in each group by listing the most common variables potentially responsible for violent crime.

  4. What impact do these different variables have on the crime rate? We would like to estimate the weight of each variable identified in question 2 on the crime rate.

  5. How can we optimize the impact of these variables on crime? We would like to identify the potentially versatile variables, in order to propose feasible solutions to decrease the crime rate.

1.4 Data

1.4.1 Where you can get it

Source: https://crime-data-explorer.fr.cloud.gov/pages/explorer/crime/crime-trend Our first and largest database is from the Federal Bureau of Investigation’s Crime Data Explorer. This data provides us with information on the rate of violent crime, rape, aggravated assault, homicide, and robbery over a 35-year period (1985 to 2020), which allowed us to create two data frames:

1st Data Frame:

Variable Meaning
Series Information about where the crime(s) was/were commited
Rate Violent crime rates by location by year
1985-2020 Year considered for rate calculation

2nd Data Frame:

Variable Meaning
Series Information about where the crime(s) was/were commited
Type_of_crime Type of cime commited
Rated Rate calculated according to the type of crime, the location, and the year
1985-2020 Year considered for rate calculation

The FBI has estimated that crime statistics are not based on data collected by 100% of law enforcement agencies. In addition, some crimes are not reported to the police, remain unsolved or unsolved. Therefore, we cannot say that the data fully reflect reality, but for the sake of our study, we will assume that it does.

Regarding the rape data, the data yields two different rates: “Legacy Rape” and “Revised Rape.” In 2013, the FBI began collecting rape data under a revised definition and removed the word “forcible” from the offense name. We then made the decision to keep only the “Revised Rape” data from 2013, since it ultimately encompasses both definitions of rape.

To answer one of our research questions, we need the violent crime rate by state for the year 2021, which we found at the following site. We then noticed that the rates are the same for all states compared to the 2020 data. We can then suggest that Covid has had little or no impact on the violent crime rate, as the article “Federal surveys show no increase in U.S. violent crime rate since start of the pandemic” shows. We then consider the same rates for the year 2020 and 2021.

Source : https://worldpopulationreview.com/states

This database has 9 variables but only 4 are of interest to us, this one relates the population in the United States by state in 2021 and 2022.

Variable Meaning
State State where the population was measured
Pop Population by geographic area in 2022
Pop2021 Population by geographic area in 2021
Growth Population growth rate from 2021 to 2022

Source : https://www.americanprogress.org/data-view/poverty-data/poverty-data-map-tool/

This database provides the poverty rate, unemployment rate, and income inequality by state for the year 2021. Data collection from the U.S. Census Bureau was interrupted due to the Covid crisis, so the published data are experimental data that we will assume are real for our study.

Variable Meaning
State State considered for the calculation of the rate
Official poverty rate Poverty rate by state (in %)
Unemployment rate Unemployment rate by state (in %)
Income inequality Income inequality by state (in %)

Source : https://www.atf.gov/firearms/docs/report/2021-firearms-commerce-report/download

This database is taken from the Bureau of Alcohol, Tobacco, Firearms and Explosives’ annual report published in 2021 which reports data for the year 2020 (page 22). We do not have data for the year 2021 since the report will not be published until December 2022, but we will consider the same data for the year 2021. We do not have data for guns bought and sold on the black market, so we will only consider legal licenses.

Variable Meaning
State State considered for the calculation of the number of the firearms licensees
FFI. Population Number of Federal Firearms Licensees

Source : https://data.bls.gov/oes/#/home

We found this database on the US Bureau of Labor Statistics website with the choices “One occupation for multiple geographic areas” à “Police and Sheriff’s Patrol Officers”à “State”à “All states in this list”. It gives the number of police and sheriff’s patrol officers by state in May 2021. There are 18 variables in this database, but we will focus on 2 of them.

Variable Meaning
States State considered for the calculation of the number of employees
Employement Number of employees by state

Source : https://www.pewresearch.org/hispanic/2020/08/20/facts-on-u-s-immigrants-current-data/

This database (“Nativity of U.S immigrants”) gives us the number of people residing in the United States who were born abroad in 2018. It has 14 variables but only 2 are important to us. Obviously, we will not consider the illegal arrivals on the territory.

Variable Meaning
States Information on where the number of immigrants
Foreign born Number of foreign-born US residents

2. Exploratory Data Analysis

2.1 Type of crimes dataset

2.1.1 What is the evolution of the crime rate in the USA ?

The first striking thing we notice on this graph is that we have two distinct trends, from 1985 to 1992, the violent crime rate increases, while overall from 1992 to 2020, the rate decreases.

In fact, since World War II, crime increased enormously until the early 1990s, and then declined sharply, due in part to the strong political commitment to law enforcement by President Bill Clinton (Democrat Party), by hiring more police and providing more funding to crime control institutions.

Increased immigration, higher wages, changing demographics in the country, and abortion rights also explain this decrease in crime, as explained on this web page.

2.1.2 Which states have the highest crimes rate ?

The states chosen here have the highest crime rates in 2020. We found it interesting to see if this has always been the case over time. In this graph, we can see that between 1990 & 1995, the violent crime rate tends to increase. Then, from 1995 to 2000, all 10 states experienced a decrease in the crime rate. From 2013, the rate increases again, especially due to the revision of the rape law.

Focusing on Alaska and Tennessee, which have the first and third highest crime rates in 2020, respectively, we can see from the trends curves that in the 1990s these two states were average compared to the others.
Conversely, New Mexico, which is in second place in 2020, had a very high crime rate in the 1990s. For the latter, this can potentially be explained by the Latin migratory wave (from Mexico) that the USA experienced from the years 70-80. It is also interesting to see that states like Louisiana or South Carolina, which were at the top of the list in the 90s, have significantly reduced their crime rate.

2.1.3 Which states have the lowest crimes rate ?

The 10 states chosen here are those with the lowest crime rates in 2020. We thought it would be interesting to analyze their evolution. New Jersey and Connecticut, are the states that experienced the biggest drop in violent crime rates between the 1990s and 2000. All others remained more or less the same.

This can be explained by changes in laws, the renewal of the police force, and a different approach to crime. For example, New Jersey had to reform its entire police force because of corruption cases. The new forces in place are trained primarily in de-escalation and dialogue during interventions.

In the case of Connecticut, for instance, it is mainly the reclassification of crimes as felonies and the granting of greater discretionary power (decisions made completely independently of the facts) to judges that has led to a significant reduction in the crime rate.

2.1.4 Which states have the highest crimes rate according to the 2015-2020 period ?



The barplot above transcribes the average violent crime by state between 2015-2020. We can clearly see that even averaging across 2015-2020, the three states with the highest crime rates remain the same as before, Alaska, New Mexico and Tennessee.




Here is a representation of an interactive map that allows to visualize the same data as on the previous barplot. Thanks to the color code, we can see which states have the highest average crime rate between 2015 and 2020, as well as their geographical location. It is obvious on this interactive map that the states with the highest crime rate are only a minority of the country (only 9/50).

2.1.5 What are the most frequent crime ?

From the data collected, it is clear that between 2015-2020 the most common violent crime committed is Aggravated assault. Followed by robbery, rape and homicide.

2.1.6 What are their evolution over time ?



In the four graphs that represent each type of violent crime, we can see that the trend for robbery, aggravated assault and homicide is the same. These last ones had a decrease of more than half between 1990 and 2010. This strong decrease can be explained by the following theories: improvement of police strategies and more generally of the police force, authorization of more and more abortion which, according to theories, would have allowed to reduce the birth of children from poor and single teenage mothers. Roughly speaking, it is assumed that these births were most likely to produce delinquents.

Another theory points to the Obama presidency. Indeed, it seems that the fact that the new president is a man of color would result in a decrease in crimes committed by people of color.

However, there was an increase around 2013-2014 in the rate of homicides and aggravated assault. This is explained, depending on the state, by a return to gang warfare, drugs and gun sales.

Concerning the evolution of the rape rate, we remind you that the United States have proceeded to a change in the law. Previously, rape only concerned the forced penetration of a woman’s vagina by a man’s penis. Today, the definition is as follows : The penetration, no matter how slight, of the vagina or anus with any body part or object, or oral penetration by a sex organ of another person, without the consent of the victim. This explains the strong increase between 2013 and today.

2.2 Formation of State clusters

2.2.1 Find the ideal number of clusters


In order to compare states more efficiently, we decided to do k-means clustering based on four dimensions - Robbery, Homicides, Aggravated Assaults and Rape. We first scaled the data and then added the state names as row names.
To determine the ideal number of clusters, we used the adjusted R-squared method using Optimal_Clusters_KMeans.
The \(R^2_a\) adjusted R-squared increases as the clusters increase, which means that the model is better explained.
Here we see that from 4 clusters onwards, the adjusted R-squared is over 80%. Due to its higher data than the other states, Alaska appears in a single cluster, so we decided, in order to have the most relevant model possible, to have 5 clusters that we will represent using fviz_cluster.

2.2.2 Group the States according to their crime behavior

In the PCA plot of the variables, we see that the variables Rape_Scaled and Aggravated_Assaults_Scaled are correlated, as well as the variables Homicides_Scaled and Robbery_Scaled.
In the graph representing the five clusters, dimension 1 is an average between the correlated variables Homicides_Scaled and Robbery_Scaled; and dimension 2 is an average between the correlated variables Rape_Scaled and Aggravated_Assaults_Scaled.


km.res[["centers"]]
##   Robbery_scaled Homicides_scaled Aggravated_assault_scaled Rape_scaled
## 1      0.7881724        1.4176379                1.48108088   0.4370596
## 2      1.1558337        1.0691436                2.75152733   4.5779411
## 3      0.8194608        0.4437583               -0.03728383  -0.5420362
## 4     -0.6686191       -0.8708740               -0.88735816  -0.3830740
## 5     -0.7665022       -0.3274591                0.21047178   0.7121579

Through the analysis of the scaled data, we assigned certain characteristics to the clusters:
- Cluster 1 is characterized by a Homicide rate higher than most (highest scaled average).
- Cluster 2 is characterized by a Robbery rate higher than most (highest scaled average), an Aggravated Assault rate higher than most (highest scaled average), a Rape rate higher than most (highest scaled average) .
- Cluster 3 is characterized by a Rape rate lower than most (lowest scaled average).
- Cluster 4 is characterized by a Robbery rate lower than most (second lowest scaled average),a Homicide rate lower than most (lowest scaled average), an Aggravated Assault rate lower than most (lowest scaled average) .
- Cluster 5 is characterized by a Robbery rate lower than most (lowest scaled average).

2.3 Macroeconomic variables Dataset

2.3.1 Macroeconomic variables behavior depending on clusters

We would like to see what the characteristics of each of the five clusters are based on the mean of each macro-variable. We created six barplots to facilitate the analysis :

Thanks to the first barplot, we can see that cluster number 3 stands out from the other four, since the average number of foreigners in the 14 states in the cluster is 12271 per 100000. Cluster 5 has a low average of 5255.

Concerning the average number of firearms licenses, cluster 2, which contains the state of Alaska, is much higher than the others with an average of 113 per 100,000 inhabitants. Cluster 3 has the lowest average of 34.

The unemployment rate averages of the 5 clusters are quite similar, nevertheless, we can see that cluster 2 (Alaska) stands out to reach a rate of 6400 per 100000. In addition, cluster 5 has the lowest average of 4000.

The police officer averages are fairly similar, but we can see that Cluster 1 has a slightly higher average than the others. Clusters 2 and 4 have almost the same average.

Regarding the poverty rate per 100,000 inhabitants, cluster 1 clearly has the highest average of 15675. In addition, clusters 2 and 4 again have almost the same average, as do clusters 3 and 5.

Clusters 1 and 3 have the highest income inequality averages, cluster 2 has the lowest. Clusters 4 and 5 have similar averages.

2.3.2 Macroeconomic variables Scatterplots

In this section, we have mainly plotted graphs showing distribution of the crime rate depending on each macroeconomic variable.

Distribution of Foreign_Born


It is difficult to tell what the trend of this graph is only with the scatterplots, nevertheless, by adding the regression line, we see that the relationship between the two variables is negative. This result is surprising because we would have thought that the immigration rate would have increased delinquency. Clusters 1 and 3 show a positive trend.

Distribution of Weapons_licences


The scatterplot shows a predominantly positive relationship between the two variables, which is confirmed by the regression line. This means that the more firearms licenses are issued, the higher the crime rate. The coefficient is not very high, however, it would be interesting to see what impact this variable has on the different types of crime.

Distribution of Unemployement rate


The regression line is increasing, so the unemployment rate positively influences the crime rate, especially for cluster 1 which show a clearly positive trend scatterplot.

Distribution of Officers


Like the previous graph, the number of law enforcement officers seems to have a positive impact on the crime rate, which is quite surprising, intuitively, one would think that the more law enforcement there is the less crime there is. As for the points in each cluster, they are scattered compared to the previous graphs.

Distribution of Poverty rate


This graph shows an increasing line and a high directing coefficient. It can be argued that the poverty rate has a strong positive influence on the crime rate.

Distribution of Income Inequality


The regression line is increasing but the directing coefficient is still quite low. The effect of income inequality on each of the four types of crime would need to be studied to clarify this relationship.


2.4 Assumptions

2.4.1 Assumption on what impact the Robbery rate

What we know from the above analysis: Cluster 4 has a low average robbery rate, low unemployment rate. Cluster 5 also has a low average robbery rate and a low unemployment rate. Both clusters also have almost the same average income inequality. Cluster 1 has a high average robbery rate and a high unemployment rate.

Therefore, it can be hypothesized that the robbery rate in a state is influenced by the unemployment rate and by income inequality.

2.4.2 Assumption on what impact the Homicides rate

What we know from the above analysis: Cluster 1 has the highest average homicide rate, low immigration rate, high poverty rate, and high law enforcement membership rate. Cluster 4 has a low average homicide rate, a high immigration rate, a low poverty rate, and a low law enforcement rate.

Therefore, it can be hypothesized that the homicide rate in a state is influenced by the immigration rate, poverty rate, and law enforcement membership rate.

2.4.3 Assumption on what impact the Aggravated Assault rate

What we know from the analysis above: Cluster 2 has high average aggravated assaults, high rates of firearms licenses issued, and high unemployment. Cluster 4 has a low average of aggravated assaults, and low rates of firearms licenses issued, and unemployment.

Thus, it can be hypothesized that the rate of aggravated assaults in a state is influenced by the rate of firearm licenses issued and the unemployment rate.

2.4.4 Assumption on what impact the Rape rate

What we know from the analysis above: Cluster 2 has a high average rape rate, high rate of firearms licenses issued, low income inequality. Cluster 3 has a low rate of rape, a low rate of firearms licenses issued, and a high rate of income inequality.

Thus, it can be hypothesized that the rate of rape in a state is influenced by the rate of firearms licenses issued and the rate of income inequality.

3. Modeling


First, we decide to remove Alaska as an outlier since this state clearly stands out from the others due to its particularly high violent crime rates as we saw in the exploratory data analysis.

3.1 Determine the robbery rate through the macro variables

We started by calculating the correlation matrix of this model:

corr_matrixRobb
Correlation matrix
Robbery Foreign_Born21 Weapons_licences Unemployement_rate Officers Income_Inequal
Robbery 1.000 0.525 -0.579 0.602 0.141 0.460
Foreign_Born21 0.525 1.000 -0.582 0.636 -0.036 0.481
Weapons_licences -0.579 -0.582 1.000 -0.429 -0.035 -0.420
Unemployement_rate 0.602 0.636 -0.429 1.000 0.103 0.707
Officers 0.141 -0.036 -0.035 0.103 1.000 0.454
Income_Inequal 0.460 0.481 -0.420 0.707 0.454 1.000

We decided to calculate the following regression, which considers all the macro-variables presented above, to determine the robbery rate :

\(Robbery =\beta_0 + \beta_1* ForeignBorn21 + \beta_2 * WeaponsLicences + \beta_3 * UnemployementRate + \\\beta_4 * Officers + \beta_5 * PovRate +\beta_6 * IncomeInequal\)


We find an \(R^2\) of 0.5244, which means that 52% of the variation in the robbery rate is explained by the linear relationship with the macro variables.

By applying backward selection based on AIC, we are told that the Employees variable is rejected. Our final regression is then :

\(Robbery =\beta_0 + \beta_1* ForeignBorn21 + \beta_2 * WeaponsLicences + \beta_3 * UnemployementRate + \\ \beta_4 * PovRate +\beta_5 * IncomeInequal\)

summary(regRobbery)
## 
## Call:
## lm(formula = Robbery ~ Foreign_Born21 + Weapons_licences + Unemployement_rate + 
##     Pov_rate + Income_Inequal, data = Bigdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.093 -11.215   2.608  10.705  53.207 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        27.0311129 25.9726415   1.041   0.3038  
## Foreign_Born21      0.0014237  0.0008696   1.637   0.1089  
## Weapons_licences   -0.3090258  0.1292561  -2.391   0.0213 *
## Unemployement_rate  0.0081597  0.0037536   2.174   0.0353 *
## Pov_rate            0.0036082  0.0016160   2.233   0.0308 *
## Income_Inequal     -0.0032709  0.0022878  -1.430   0.1600  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.39 on 43 degrees of freedom
## Multiple R-squared:  0.5423, Adjusted R-squared:  0.4891 
## F-statistic: 10.19 on 5 and 43 DF,  p-value: 1.738e-06

We see that of the five variables considered, three of them are statistically significant. We therefore assume that Weapons licences, Unemployment rate and Poverty rate have a relevant impact on the determination of the robbery rate in the United States. Nevertheless, removing the variables that are not statistically different from 0 could change or cancel the impact of the others on the dependent variable. It is therefore important to keep all the variables that the backward induction suggested based on the AIC, as proven before.


We check if there is a multi-collinearity issue :

Variables Tolerance VIF
Foreign_Born21 0.286 3.494
Weapons_licences 0.606 1.649
Unemployement_rate 0.376 2.659
Pov_rate 0.424 2.361
Income_Inequal 0.318 3.148


We have no VIF greater than 5, so we assume that multi-collinearity is not a concern.

Accuracy of the Linear Model
RMSE MAE MASE
18.16062 14.38561 0.6619215


We see that this model has an MAE of 14, which means that on average, the model has an error of 14. In terms of robberies, our model does not explain 1400000 of the robberies in the US.
We see that the points on the QQ-Plot do not fall on the curve but are not scattered far from it. There are therefore anomalies but the model remains relevant since there is a relationship between the independent variables and the dependent variable.

Our first hypothesis was that the robbery rate depended on the unemployment rate and income inequality. Our model confirmed this hypothesis. However, the immigration rate, the number of firearms licenses issued and the poverty rate also influence the robbery rate.

3.2 Determine the Homicides rate through the macro variables

We started by calculating the correlation matrix of this model:

corr_matrixHomi
Correlation matrix
Homicides Foreign_Born21 Weapons_licences Unemployement_rate Officers Income_Inequal
Homicides 1.000 -0.214 -0.203 0.156 0.430 0.350
Foreign_Born21 -0.214 1.000 -0.582 0.636 -0.036 0.481
Weapons_licences -0.203 -0.582 1.000 -0.429 -0.035 -0.420
Unemployement_rate 0.156 0.636 -0.429 1.000 0.103 0.707
Officers 0.430 -0.036 -0.035 0.103 1.000 0.454
Income_Inequal 0.350 0.481 -0.420 0.707 0.454 1.000

We decided to calculate the following regression, which considers all the macro-variables presented above, to determine the Homicides rate :

\(Homicides =\beta_0 + \beta_1* ForeignBorn21 + \beta_2 * WeaponsLicences + \beta_3 * UnemployementRate + \\ \beta_4 * Officers + \beta_5 * PovRate +\beta_6 * IncomeInequal\)


We find an \(R^2\) of 0.5811, which means that 58% of the variation in the theft rate is explained by the linear relationship with the macro variables.

By applying backward selection based on AIC, we are told that the Unemployement & Income Inequality variables are rejected. Our final regression is then :

\(Homicides =\beta_0 + \beta_1* ForeignBorn21 + \beta_2 * WeaponsLicences + \beta_3 * Officers + \\ \beta_4 * PovRate\)

summary(regHomicides)
## 
## Call:
## lm(formula = Homicides ~ Foreign_Born21 + Weapons_licences + 
##     Officers + Pov_rate, data = Bigdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7030 -1.3072 -0.1916  1.3940  5.2785 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.7163060  2.4968823  -1.088   0.2826    
## Foreign_Born21   -0.0001262  0.0000647  -1.951   0.0575 .  
## Weapons_licences -0.0320626  0.0136464  -2.350   0.0233 *  
## Officers          0.0174085  0.0098143   1.774   0.0830 .  
## Pov_rate          0.0006448  0.0001299   4.962 1.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.048 on 44 degrees of freedom
## Multiple R-squared:  0.5923, Adjusted R-squared:  0.5552 
## F-statistic: 15.98 on 4 and 44 DF,  p-value: 3.763e-08

We see that of the four variables included in the regression, weapons licenses and the poverty rate are statistically significant at a different level, meaning that varying them would have a large impact on the homicides rate. We notice that the intercept is negative for this regression and is equal to -2.7. It may seem worrying to have a negative intercept, but this is not a problem. It can be explained by the fact that our macro variables are expressed differently (some in tens, hundreds, or thousands) but are all reduced to 100k, the intercept will “correct” these differences.
We check if there is a multi-collinearity issue

Variables Tolerance VIF
Foreign_Born21 0.577 1.733
Weapons_licences 0.607 1.647
Officers 0.835 1.198
Pov_rate 0.731 1.367


We have no VIF greater than 5, so we assume that multi-collinearity is not a concern.


Accuracy of the Linear Model
RMSE MAE MASE
1.940874 1.562546 0.6253413

We see that the points on the QQ-Plot do not fall on the curve but are not scattered far from it. There are therefore anomalies but the model remains relevant since there is a relationship between the independent variables and the dependent variable.

Our second hypothesis said that the homicide rate depended on the immigration rate, the poverty rate, and the number of firearms licenses issued. Our regression model confirmed this hypothesis, but the rate of law enforcement officers also impacts the homicide rate.

3.3 Determine the Aggravated Assault rate through the macro variables

We started by calculating the correlation matrix of this model:

corr_matrixAssault
Correlation matrix
Aggravated_assault Foreign_Born21 Weapons_licences Unemployement_rate Officers Income_Inequal
Aggravated_assault 1.000 -0.181 -0.004 0.045 0.292 0.202
Foreign_Born21 -0.181 1.000 -0.582 0.636 -0.036 0.481
Weapons_licences -0.004 -0.582 1.000 -0.429 -0.035 -0.420
Unemployement_rate 0.045 0.636 -0.429 1.000 0.103 0.707
Officers 0.292 -0.036 -0.035 0.103 1.000 0.454
Income_Inequal 0.202 0.481 -0.420 0.707 0.454 1.000

We decided to calculate the following regression, which considers all the macro-variables presented above, to determine the Aggravated_assault rate :

\(Assault =\beta_0 + \beta_1* ForeignBorn21 + \beta_2 * WeaponsLicences + \beta_3 * UnemployementRate + \\\beta_4 * Officers + \beta_5 * PovRate +\beta_6 * IncomeInequal\)


We find an \(R^2\) of 0.3373, which means that 34% of the variation in the theft rate is explained by the linear relationship with the macro variables.

By applying backward selection based on AIC, we are told that all the variables are rejected except Poverty rate. Our final regression is then :

\(Assault =\beta_0 + \ \beta_1 * PovRate\)

summary(regAssault)
## 
## Call:
## lm(formula = Aggravated_assault ~ Pov_rate, data = Bigdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -262.716  -64.328   -4.949   34.947  244.526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -71.908152  68.558458  -1.049      0.3    
## Pov_rate      0.027249   0.005334   5.109 5.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 98.31 on 47 degrees of freedom
## Multiple R-squared:  0.3571, Adjusted R-squared:  0.3434 
## F-statistic:  26.1 on 1 and 47 DF,  p-value: 5.815e-06



Accuracy of the Linear Model
RMSE MAE MASE
96.28532 70.70843 0.7543343

We note that five clearly identifiable points stand out on the red curve. Again, some states have lower than average rates of aggravated assault (Maine, New Hampshire, Connecticut), others higher (New Mexico, Tennessee). The rest of the points are very close to the curve.

Our third hypothesis was that the rate of aggravated assaults depended on the number of firearms licenses issued and the unemployment rate. This was disproved by our model, which assumes that the homicide rate is only affected by the poverty rate.

3.4 Determine the Rape rate through the macro variables

We started by calculating the correlation matrix of this model:

corr_matrixRape
Correlation matrix
Rape_net Foreign_Born21 Weapons_licences Unemployement_rate Officers Income_Inequal
Rape_net 1.000 -0.370 0.468 -0.263 -0.056 -0.352
Foreign_Born21 -0.370 1.000 -0.582 0.636 -0.036 0.481
Weapons_licences 0.468 -0.582 1.000 -0.429 -0.035 -0.420
Unemployement_rate -0.263 0.636 -0.429 1.000 0.103 0.707
Officers -0.056 -0.036 -0.035 0.103 1.000 0.454
Income_Inequal -0.352 0.481 -0.420 0.707 0.454 1.000

We decided to calculate the following regression, which considers all the macro-variables presented above, to determine the Rape rate :

\(Rape =\beta_0 + \beta_1* ForeignBorn21 + \beta_2 * WeaponsLicences + \beta_3 * UnemployementRate + \\ \beta_4 * Officers + \beta_5 * PovRate +\beta_6 * IncomeInequal\)


We find an \(R^2\) of 0.4257, which means that 42% of the variation in the theft rate is explained by the linear relationship with the macro variables.

By applying backward selection based on AIC, we are told that ForeignBorn & Officers variables are rejected. Our final regression is then :

\(Rape =\beta_0 + \beta_1 * WeaponsLicences + \beta_2 * UnemployementRate + \\ \beta_3 * PovRate +\beta_4 * IncomeInequal\)

summary(regRape)
## 
## Call:
## lm(formula = Rape_net ~ Weapons_licences + Unemployement_rate + 
##     Pov_rate + Income_Inequal, data = Bigdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2484  -7.2032  -0.8501   3.7470  24.2894 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        49.3576978 13.5252639   3.649 0.000694 ***
## Weapons_licences    0.1571027  0.0598584   2.625 0.011884 *  
## Unemployement_rate  0.0017767  0.0017675   1.005 0.320311    
## Pov_rate            0.0021888  0.0006554   3.340 0.001717 ** 
## Income_Inequal     -0.0032506  0.0011215  -2.898 0.005828 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 44 degrees of freedom
## Multiple R-squared:  0.4023, Adjusted R-squared:  0.3479 
## F-statistic: 7.403 on 4 and 44 DF,  p-value: 0.0001193

Regarding the final regression of the rape rate, three of the four variables considered are statistically significant at different levels: weapons licensees, poverty rate and income inequality, so they have an influence on the rape rate.

We check if there is a multi-collinearity issue :

Variables Tolerance VIF
Weapons_licences 0.777 1.287
Unemployement_rate 0.466 2.144
Pov_rate 0.708 1.413
Income_Inequal 0.363 2.752


We have no VIF greater than 5, so we assume that multi-collinearity is not a concern.

Accuracy of the Linear Model
RMSE MAE MASE
9.632063 7.460803 0.7593315

We can see that the QQ-plot has some anomalies, related to states with higher than average rape rates, like Arkansas, South Dakota, Colorado, Michigan… The other points are on the curve or very close.

Our fourth hypothesis said that the rape rate depended on the number of firearms licenses issued and the rate of income inequality. Our model confirmed this hypothesis, but the unemployment rate and the poverty rate also affect the rape rate.

3.5 Regression

Robbery Scatter Plot Matrix

Homicides Scatter Plot Matrix

Aggravated_assault Scatter Plot Matrix

Rape Scatter Plot Matrix

4. Conclusion

The crime rate has varied widely between 1985 and 2020. Its increase was due to the geopolitical situation of the United States, which was going through a period of crisis, and its decrease was often related to demographic and economic changes, such as increased immigration, wages, and police manpower. However, we can note that the rate has been fairly stable since 2016. The COVID crisis significantly increased the homicide rate, for example, due to the quarantine and numerous protests as shown in this study.

As seen from the bar plot in 2.1.5, the most common crime committed in the U.S. is aggravated assault. Our regression model clearly expresses the fact that the poverty rate massively impacts the number of aggravated assaults. Poverty also influences the other three types of crime, at different levels. One could then suggest solutions to lower this rate in order to lower the crime rate.

  • One of the particularities of the United States is that the health care system is failing and unequal, American citizens can only afford health care if they can afford it, otherwise they go into debt, which increases the rate of poverty, and therefore more and more crimes committed. Revising the health care system to make it fairer would be a great step forward for the country.

  • In addition, the American work life is based on the ideology of meritocracy. This leaves many citizens unemployed or in low-paying jobs, which contributes to the rising poverty rate. The creation of new jobs should massively reduce the poverty rate and the unemployment rate.

  • The problem of hate crimes and racism is also very present in some states, as we saw in 2020 with the “Black Lives Matter” movement following the murder of a citizen of color. The image of law enforcement people took a hit, they were not taken seriously or respected by citizens. Thousands of resignations were recorded in the sector. Awareness programs should be incorporated into schools to prevent hate crimes and to teach young people about their rights as citizens. It would also be important to remobilize law enforcement to lower the homicide rate.

  • Ensure equity from birth, regardless of social class and ethnicity. It would be necessary to allow citizens to access a proper level of education without going into debt.

Our study does not, of course, take into account unsolved or unrecorded crime and black market activities in the United States, such as undeclared employment, illegal gun carrying, and illegal immigration. Unfortunately, it is impossible to estimate these factors and incorporate them into the model. Nevertheless, they certainly have an impact on the rates we have estimated.

In addition, some states have much higher than average crime rates, such as Alaska, New Mexico, Tennessee, and Arkansas. It might make sense to impose a limit on the crime rate that should not be exceeded by state, otherwise the governor should be forced to lower the rate by any way necessary.

As a result of our exploratory analysis of the data, we noticed that the party (Democrat or Republican) in power at a given time had a large impact on the crime rate. While crime was peaking, when Democrat Bill Clinton took office in 1993, the crime rate dropped, thanks to the introduction of new laws and increased police staffing (as explained in section 2.1.1). It would be extremely interesting to take this new variable into account to see its impact on different types of crime.